-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SYCL Track Finding, main branch (2024.11.14.) #773
SYCL Track Finding, main branch (2024.11.14.) #773
Conversation
@@ -36,7 +36,7 @@ struct apply_interaction_payload { | |||
* @brief View object to the vector of boolean-like integers describing | |||
* whether each parameter is live. Has the same size as \ref params_view | |||
*/ | |||
vecmem::data::vector_view<const unsigned int> params_liveness_view; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for this change? CUDA doesn't natively support 8-bit loads so I'm a bit worried about the performance implications of this. Also is there a reason to use char
and not unsigned char
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use these values as bools. Using 32-bits where we only need 1, seems very silly.
Of course bool
doesn't work. 😦 But our convention in the offline code is to use char
when we need "boolean information", but bool
can't be used.
Also, take this into account: https://github.com/acts-project/traccc/blob/main/device/cuda/src/finding/finding_algorithm.cu#L154
What do you think is actually getting set for this buffer with that operation? 😏 Because it's not 0x1
values in the unsigned int
variables...
In any case, I can't see why we shouldn't go for this. Even if NVIDIA always copies at least 16 bits, right now we move 32 bits in all cases. Even though we only need 1. If some of the loads are next to each other, this could still win us a little bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think is actually getting set for this buffer with that operation? 😏 Because it's not 0x1 values in the unsigned int variables...
It's setting a non-zero value; I don't see the problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case, I can't see why we shouldn't go for this. Even if NVIDIA always copies at least 16 bits, right now we move 32 bits in all cases. Even though we only need 1. If some of the loads are next to each other, this could still win us a little bit.
Think of what happens if four adjacent threads want to write their chars to global memory at the same time.
I seriously do not see why we need to change this to save what boils down to 240 kilobytes...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logically the code worked. But you can't argue that we're not spending more time with memsetting these values, and using more global memory on it, then we need to. 🤔
unsigned int inline TRACCC_DEVICE getLocalThreadId() const { | ||
return threadIdx.x; | ||
} | ||
|
||
std::size_t inline TRACCC_DEVICE getLocalThreadIdX() const { | ||
unsigned int inline TRACCC_DEVICE getLocalThreadIdX() const { | ||
return threadIdx.x; | ||
} | ||
|
||
std::size_t inline TRACCC_DEVICE getGlobalThreadId() const { | ||
unsigned int inline TRACCC_DEVICE getGlobalThreadId() const { | ||
return threadIdx.x + blockIdx.x * blockDim.x; | ||
} | ||
|
||
std::size_t inline TRACCC_DEVICE getGlobalThreadIdX() const { | ||
unsigned int inline TRACCC_DEVICE getGlobalThreadIdX() const { | ||
return threadIdx.x + blockIdx.x * blockDim.x; | ||
} | ||
|
||
std::size_t inline TRACCC_DEVICE getBlockIdX() const { return blockIdx.x; } | ||
unsigned int inline TRACCC_DEVICE getBlockIdX() const { return blockIdx.x; } | ||
|
||
std::size_t inline TRACCC_DEVICE getBlockDimX() const { return blockDim.x; } | ||
unsigned int inline TRACCC_DEVICE getBlockDimX() const { | ||
return blockDim.x; | ||
} | ||
|
||
std::size_t inline TRACCC_DEVICE getGridDimX() const { return gridDim.x; } | ||
unsigned int inline TRACCC_DEVICE getGridDimX() const { return gridDim.x; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
? Why? 😕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are all unsigned int
-s natively. If we return them as std::size_t
, I'll have to add a whole lot of static_cast
-s in the kernels to make sure that we wouldn't use unsigned long
-s where unsigned int
is expected.
As it turns out, our CUDA build does not look for type conversions / narrowings. oneAPI does. So I got a lot of warnings from such things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, but why not just add the static_cast
s then? getGlobalThreadIdX()
can return a number larger than an unsigned integer can store.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I am not sure why the OneAPI compiler even sees a type that is supposed to be used only in CUDA code. That's worrying. 😟
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kernels were/are receiving std::size_t
values for the thread ID. And then they use those values in a lot of functions that expect unsigned int
values. That's where I was getting the warnings from.
But as long as I'm modifying the signature of the traccc::device::
functions, I thought I'd also have a look at what's going on in the CUDA code... 🤔
return threadIdx.x + blockIdx.x * blockDim.x; | ||
} | ||
|
||
std::size_t inline TRACCC_DEVICE getGlobalThreadIdX() const { | ||
unsigned int inline TRACCC_DEVICE getGlobalThreadIdX() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 I think it would be fair to put in an assertion into this function that this value would not be larger than unsigned int
. At the same time, notice that the calculation on line 30 works with unsigned int
itself. If the result of that overflows, the std::size_t
return type will not do anything to help. 🤔
0e3fddc
to
2bacf76
Compare
Unfortunately the situation is that the AMD backend of oneAPI doesn't seem to work with the code. 😦 I now tried a number of incantations, but linking Of course I've been developing the code with the NVIDIA backend. Where it compiles/links fine. (Note that there are no runtime tests in the code just yet.) With the Intel backend one needs to be a bit careful as well, but at least the build can succeed with that one. But with the AMD backend there seem to be some bigger issues. 😢
|
Some interesting stuff going on. 🤔 After fixing yet one more bug in the common track finding code, I can now get "this far":
Surprisingly, we run out of registers on this piece of code: oneapi::dpl::sort_by_key(policy, keys_device.begin(),
keys_device.end(),
param_ids_device.begin()); Which I'm trying to get some help with in: uxlfoundation/oneDPL#1936 Track finding is hard... 😦 |
ed91591
to
665ea4a
Compare
🎆 The unit test now succeeds with the NVIDIA backend! 🎆
But the PR still probably needs to stay open until we switch to oneAPI 2025.0.0 in the CI, to make the Intel backended build work as well. 🤔 |
89717e4
to
8504f7f
Compare
8504f7f
to
24241b9
Compare
24241b9
to
f3de0c6
Compare
f3de0c6
to
65ffad6
Compare
d5ee90c
to
8f9b751
Compare
032f844
to
6486a22
Compare
Let me catch people up. The SYCL code seems to be functional finally, now that I fixed the very-very silly mistake that I made while translating it from CUDA. I can now reliably run the CKF unit test on our NVIDIA and AMD GPUs.
I can also run the unit test successfully on a CPU.
But unfortunately the SYCL runtime is having a very hard time with JIT-ing the track finding code for our Intel GPU. 😦
(It is just stuck at this point, spinning the CPU at full tilt.) This unfortunately was to be expected. 😦 Since last week, when I tried to ahead-of-time compile the code specifically for our Intel GPU, the build had to be shut down after 3 hours. (It was still trying to link So we'll have to have some heart-to-heart with the Intel developers. But at least the other backends seem functional by now. 🥳 |
The code doesn't do anything, and is not used by anything yet.
Made the "track liveness" buffers use char instead of unsigned int. Since they store boolean information, char is enough. And it also makes memset(...) do a more expected thing on the buffers. Updated device::make_barcode_sequence not to narrow std::size_t into unsigned int. (Which oneAPI doesn't like.)
Mainly to avoid type conversion / narrowing in the code, but also to simplify it slightly in some places.
Trying to avoid confusion at runtime about which kernel is which.
Just as a convenience method for seeing in the SYCL unit tests which exact device is used.
Taught traccc::sycl::test_queue how to figure out what sort of a queue it is. So that the CKF test could be skipped on OpenCL and Level-0 backends as long as those are still not working.
8155284
to
65fd176
Compare
Quality Gate passedIssues Measures |
Replaced by #811. |
This is finally a CKF algorithm for SYCL. 🥳
Still some technical stuff to set up, I just wanted to open it already, so people would be aware. 😄 (I'll open at least one PR off of this first, which should go in before this one.)